W3: Data Wrangling with Tidy Data, Part 1

Where are we?

Illustration by Allison Horst

How’s it going?Illustration by Allison Horst

Data Science Workflow

We start with Transform and Visualize with the assumption that our data is in a nice, “tidy” state.

Our working Tidy Data: DepMap Project

https://depmap.org/

We will work with metadata, mutation, and expression dataframes.

What do you want to do with this dataframe?

Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something.

With Tidy data, we can ponder how we want to transform our data that satisfies our scientific question.

Subsetting a dataframe

In the dataframe you have here, which rows would you filter for and columns would you select that relate to a scientific question?

✅ Implicit: “I want to filter for rows such that the subtype is breast cancer and look at the Age and Sex.”

🚫 Explicit: “I want to filter for rows 20-50 and select columns 2 and 8”.

Notice that when we filter for rows in an implicitly way, we often formulate criteria about the columns.

How we do it:

library(tidyverse)

metadata_filtered = filter(metadata, OncotreeLineage == "Breast")
breast_metadata = select(metadata_filtered, ModelID, Age, Sex)

head(breast_metadata)
     ModelID Age    Sex
1 ACH-000017  43 Female
2 ACH-000019  69 Female
3 ACH-000028  69 Female
4 ACH-000044  47 Female
5 ACH-000097  63 Female
6 ACH-000111  41 Female

Here, filter() and select() are functions from the tidyverse package.

filter() and select()

metadata_filtered = filter(metadata, OncotreeLineage == "Breast"):

The second argument: a logical indexing vector built from a comparison operator?

But the variable OncotreeLineage does not exist in our environment!

Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable. We can directly refer to the column vector metadata$OncotreeLineage with just OncotreeLineage.

The input arguments for select() are also data variables.

Summary statistics

Now that your dataframe has be transformed based on your scientific question, you can start doing some analysis on it!

If the columns of interest are numeric, consider functions mean(), median(), max(), on a column.

If the columns of interest is character or logical, consider table().

mean(breast_metadata$Age, na.rm = TRUE)
[1] 50.96104
table(breast_metadata$Sex)

 Female Unknown 
     91       1 

Code readability with many nested functions

When combining multiple functions in one expression, it gets harder to read:

breast_metadata = select(filter(metadata, OncotreeLineage == "Breast"), ModelID, Age, Sex)

Or, this: 🤨

result2 = function1(function2(function3(dataframe)))

Or… 🤕

result = function1(function2(function3(dataframe, df_col4, df_col2), arg2), df_col5, arg1)

Pipes to make nested functions readable

result2 = dataframe %>% function1 %>% function2 %>% function3
result = function1(df_col5, arg1) %>%
         function2(arg2) %>%
         function3(df_col4, df_col2)

Rewrite the select() and filter() function composition example above using the pipe metaphor and syntax.

breast_metadata = metadata %>% filter(OncotreeLineage == "Breast") %>%
                               select(ModelID, Age, Sex)

🤠